#Loading the packages and data as shown on the previous guide "Introduction to R". 

packages.to.load <- c("plyr", "dplyr" ,"DT", "ggplot2", "plotly", "RODBC")

invisible(lapply(packages.to.load, library, character.only=TRUE))


load("C:/Users/shregmi/Documents/Current Projects/R Training/Example Datasets/Auto.rda")

write.csv(Auto, file="Autodata.csv")

auto.data <- read.csv(file ="C:/Users/shregmi/Documents/Current Projects/R Training/Autodata.csv", header = TRUE)

1 Manipulating Data

This section will cover a couple ways to shape data into something that is easy to work with.


1.1 Using the $ operator

For very simple filters and selection of certain columns from a data frame, you can use the $ operator. Here, we take one column from our auto.data data frame and assign it to a new object. You can use the class() function to see what kind of object it is.

JustMPG <- matrix(auto.data$mpg)

class(JustMPG)
## [1] "matrix"

We can also pull multiple columns pretty easily.

MPGandWeight <- data.frame(auto.data$mpg, auto.data$weight)
head(MPGandWeight)
##   auto.data.mpg auto.data.weight
## 1            18             3504
## 2            15             3693
## 3            18             3436
## 4            16             3433
## 5            17             3449
## 6            15             4341

1.2 dplyr

The package dplyr is a heavily used library for data cleaning and preparation (one of R’s main strong suits). Here is a little cheat sheet covering some dplyr and tidyr features: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

There are 5 main functions or “verbs” used in dplyr: Select, Filter, Mutate, Arrange, and Summarize. Many of these align with the main SQL commands: Select, Where, Group By, etc…

1.2.1 Select

Let’s take a subset of columns from the auto.data dataset.

auto.subset <- select(auto.data, name, mpg, cylinders, weight)

head(auto.subset)
##                        name mpg cylinders weight
## 1 chevrolet chevelle malibu  18         8   3504
## 2         buick skylark 320  15         8   3693
## 3        plymouth satellite  18         8   3436
## 4             amc rebel sst  16         8   3433
## 5               ford torino  17         8   3449
## 6          ford galaxie 500  15         8   4341

We can also specify which columns to exclude using the “-” operator.

auto.subset2 <- select(auto.data, -horsepower, -year)
head(auto.subset2)
##   X mpg cylinders displacement weight acceleration origin
## 1 1  18         8          307   3504         12.0      1
## 2 2  15         8          350   3693         11.5      1
## 3 3  18         8          318   3436         11.0      1
## 4 4  16         8          304   3433         12.0      1
## 5 5  17         8          302   3449         10.5      1
## 6 6  15         8          429   4341         10.0      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

1.2.2 Filter

After selecting your data of interest, you can pass filters in the same way as you would a “where” clause in SQL.

only.4cyl <- filter(auto.subset, cylinders==4)
## Warning: package 'bindrcpp' was built under R version 3.3.3
head(only.4cyl)
##                           name mpg cylinders weight
## 1        toyota corona mark ii  24         4   2372
## 2                 datsun pl510  27         4   2130
## 3 volkswagen 1131 deluxe sedan  26         4   1835
## 4                  peugeot 504  25         4   2672
## 5                  audi 100 ls  24         4   2430
## 6                     saab 99e  25         4   2375

You can add multiple arguments to the filter function.

multi.filter <- filter(auto.subset, cylinders %in% c(4,6), weight<3000)
head(multi.filter)
##                           name mpg cylinders weight
## 1        toyota corona mark ii  24         4   2372
## 2              plymouth duster  22         6   2833
## 3                   amc hornet  18         6   2774
## 4                ford maverick  21         6   2587
## 5                 datsun pl510  27         4   2130
## 6 volkswagen 1131 deluxe sedan  26         4   1835

We can filter strings as well.

filter.name <- filter(auto.subset, cylinders==4, weight<2000, grepl("toyota|volkswagen", name))
head(filter.name)
##                           name mpg cylinders weight
## 1 volkswagen 1131 deluxe sedan  26         4   1835
## 2          toyota corolla 1200  31         4   1773
## 3         volkswagen model 111  27         4   1834
## 4      volkswagen super beetle  26         4   1950
## 5          toyota corolla 1200  32         4   1836
## 6            volkswagen dasher  26         4   1963

1.2.3 Mutate

Using mutate, we can create new columns. Additionally, we can chain the previous “verbs” together using the “%>%” operator (called pipe operator). Note that the dplyr package must be loaded in order to use this operator.

add.col <- auto.data %>%
  select(name, mpg, cylinders, weight) %>%
  filter(weight>1800) %>%
  mutate(weightpercyl <- weight/cylinders)

head(add.col)
##                        name mpg cylinders weight
## 1 chevrolet chevelle malibu  18         8   3504
## 2         buick skylark 320  15         8   3693
## 3        plymouth satellite  18         8   3436
## 4             amc rebel sst  16         8   3433
## 5               ford torino  17         8   3449
## 6          ford galaxie 500  15         8   4341
##   weightpercyl <- weight/cylinders
## 1                          438.000
## 2                          461.625
## 3                          429.500
## 4                          429.125
## 5                          431.125
## 6                          542.625

1.2.4 Arrange

This is equivalent to an “order by” statement in SQL. Here, we sort the previous output by “weightpercyl” in descending order using the “-” operator.

add.col.sorted <- auto.data %>%
  select(name, mpg, cylinders, weight) %>%
  filter(weight>1800) %>%
  mutate(weightpercyl = weight/cylinders) %>%
  arrange(-weightpercyl)

head(add.col.sorted)
##                        name  mpg cylinders weight weightpercyl
## 1                mazda rx-4 21.5         3   2720     906.6667
## 2               peugeot 504 19.0         4   3270     817.5000
## 3        mercedes-benz 240d 30.0         4   3250     812.5000
## 4 peugeot 505s turbo diesel 28.1         4   3230     807.5000
## 5             mazda rx-7 gs 23.7         3   2420     806.6667
## 6               peugeot 504 27.2         4   3190     797.5000

1.2.5 Summarize

summarized.auto <- auto.data %>%
  select(name, mpg, cylinders, weight) %>%
  filter(weight>1800) %>%
  mutate(weightpercyl = weight/cylinders) %>%
  summarise(avg_mpg = mean(mpg),
            min_weight = min(weight),
            median_weightpercyl = median(weightpercyl))

summarized.auto
##    avg_mpg min_weight median_weightpercyl
## 1 23.18903       1825            544.1667

We can pass a group_by() clause as well.

summarized.auto2 <- auto.data %>%
  select(name, mpg, cylinders, weight) %>%
  filter(weight>1800) %>%
  mutate(weightpercyl = weight/cylinders) %>%
  group_by(cylinders) %>%
  summarise(avg_mpg = mean(mpg),
            avg_weight = mean(weight),
            avg_weightpercyl = mean(weightpercyl))

summarized.auto2
## # A tibble: 5 x 4
##   cylinders avg_mpg avg_weight avg_weightpercyl
##       <int>   <dbl>      <dbl>            <dbl>
## 1         3    20.6      2398.             800.
## 2         4    29.0      2331.             583.
## 3         5    27.4      3103.             621.
## 4         6    20.0      3202.             534.
## 5         8    15.0      4115.             514.

2 Visualization

This section will cover a few useful packages and methods for visualization in an R Markdown document.


2.1 Displaying Tables in R Markdown

When using print() or head() functions to show the contents of a data frame, the output looks very ugly. There are a huge number of packages that are specifically for displaying tables in an aesthetically pleasing way. The “DT” or Data Tables package is one of them.

datatable(auto.data)

This is the default way to display a data frame with this package with no arguments provided except for the object that is to be shown.

There are a huge number of arguments you can add to make your reports look nicer and to add functionality. BONUS: You’ll also write the smallest piece of JavaScript code in this as well.

datatable(auto.data, extensions = 'Buttons', options = list(
  dom = 'Bfrtip', 
  buttons= c('copy', 'excel'),
  pageLength = 5,
  initComplete = JS("
    function(settings, json) {
      $(this.api().table().header()).css({
        'background-color': '#232D69',
        'color': '#fff'
      });
    }")
), rownames = FALSE)

2.2 Charts with Plotly

Cheat sheet for plotly: https://images.plot.ly/plotly-documentation/images/r_cheat_sheet.pdf.

Another great resource: https://plot.ly/r/

plot <- plot_ly(x = auto.data$weight, type="histogram")
plot

In plotly, you can add multiple sets of traces or bars to the same plot with the use of pipe operators. You can also name each of the traces/bars so that the result is clear.

plot2 <- plot_ly(data =auto.data, alpha = 0.5) %>% #adjusting color transparency 
  add_histogram(x = auto.data$weight[auto.data$origin == 1], name="American") %>%
  add_histogram(x = auto.data$weight[auto.data$origin == 2], name="European") %>%
  add_histogram(x = auto.data$weight[auto.data$origin == 3], name="Japanese") %>%
  layout(barmode = "overlay")

plot2

Here is an example of a scatterplot generated by plotly. This has many more features and much more functionality than the plot we first generated with the R Base plot() function.

auto.data$OriginName[which(auto.data$origin == 1)]  = "American"
auto.data$OriginName[which(auto.data$origin == 2)]  = "European"
auto.data$OriginName[which(auto.data$origin == 3)]  = "Japanese"


plot3 <- plot_ly(data = auto.data, x = ~weight, y = ~mpg, color = ~OriginName, type ="scatter", mode="markers", text= ~paste("Car name: ", name ,
                                                                                                                                "</br> Year: ", year,
                                                                                                                                "</br> Cylinders: ", cylinders))

plot3

***

2.3 Charts with ggplot2

Cheat sheet for ggplot2: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf